Project Name - Ford Bike Sharing Project¶
Project Summary -¶
This project focuses on analyzing individual bike rides made through the Ford GoBike sharing system in the San Francisco Bay Area. The primary objective of this analysis is to uncover user behavior patterns, determine key trends related to trip duration, and identify factors influencing ridership, ultimately providing actionable insights to enhance the efficiency and profitability of the bike-sharing system.
GitHub Link -¶
Problem Statement¶
The Ford GoBike bike-sharing system operates across the San Francisco Bay Area, providing convenient and eco-friendly transportation. However, to optimize operations, improve user experience, and expand its reach, the company needs a deeper understanding of user behavior, trip patterns, and system usage. Without these insights, it is difficult to make data-driven decisions regarding infrastructure expansion, targeted marketing, and service improvements. This project aims to analyze historical trip data to uncover key trends, user preferences, and operational inefficiencies.
Define Your Business Objective?¶
- Analyze Trip Durations -
Determine how long the average trip lasts and identify any outliers.
Compare trip durations across different user types and demographics.
- Understand Temporal Patterns -
Identify peak ride hours, days, and months.
Explore how seasonality impacts ride frequency and duration.
- Segment Users Based on Behavior -
Compare the behavior of subscribers and casual customers.
Understand how gender and age influence riding habits.
- Identify High-Demand Stations -
Locate the most popular start and end stations.
Provide recommendations for station expansion or bike redistribution.
- Improve User Experience -
Identify user groups (e.g., casual riders or underrepresented demographics) for targeted engagement.
Suggest enhancements to service offerings based on usage data.
- Support Strategic Decisions -
- Provide data-driven recommendations to increase ridership, reduce operational bottlenecks, and enhance customer satisfaction.
Let's Begin !¶
1. Know Your Data¶
Import Libraries¶
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
Dataset Loading¶
# Load Dataset
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Function to load a CSV file into a pandas DataFrame
def load_csv(file_path):
try:
return pd.read_csv(file_path) # Load the CSV file
except Exception as e:
print(f"Error: {e}") # Print error message if loading fails
return None # Return None in case of failure
# Define file paths for the dfs
fordbike_path = '/content/drive/MyDrive/Labmentix_Internship_Projects/Ford_Bike_Sharing Project/fordgobike-tripdata.csv' # File path for the ford-bike df
# Load the df using the load_csv function
fordbike_df = load_csv(file_path=fordbike_path) # Load ford-bike df
# Display all columns
pd.set_option('display.max_columns', None)
Mounted at /content/drive
Dataset First View¶
# Dataset First Look
fordbike_df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75284 | 2018-01-31 22:52:35.2390 | 2018-02-01 19:47:19.8240 | 120 | Mission Dolores Park | 37.761420 | -122.426435 | 285 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 2765 | Subscriber | 1986.0 | Male | No |
| 1 | 85422 | 2018-01-31 16:13:34.3510 | 2018-02-01 15:57:17.3100 | 15 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 15 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 2815 | Customer | NaN | NaN | No |
| 2 | 71576 | 2018-01-31 14:23:55.8890 | 2018-02-01 10:16:52.1160 | 304 | Jackson St at 5th St | 37.348759 | -121.894798 | 296 | 5th St at Virginia St | 37.325998 | -121.877120 | 3039 | Customer | 1996.0 | Male | No |
| 3 | 61076 | 2018-01-31 14:53:23.5620 | 2018-02-01 07:51:20.5000 | 75 | Market St at Franklin St | 37.773793 | -122.421239 | 47 | 4th St at Harrison St | 37.780955 | -122.399749 | 321 | Customer | NaN | NaN | No |
| 4 | 39966 | 2018-01-31 19:52:24.6670 | 2018-02-01 06:58:31.0530 | 74 | Laguna St at Hayes St | 37.776435 | -122.426244 | 19 | Post St at Kearny St | 37.788975 | -122.403452 | 617 | Subscriber | 1991.0 | Male | No |
Dataset Rows & Columns count¶
# Dataset Rows & Columns count
print('Number of Rows =',fordbike_df.shape[0])
print('Number of Columns =',fordbike_df.shape[1])
Number of Rows = 94802 Number of Columns = 16
Dataset Information¶
# Dataset Info
fordbike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 94802 entries, 0 to 94801 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 94802 non-null int64 1 start_time 94802 non-null object 2 end_time 94802 non-null object 3 start_station_id 94802 non-null int64 4 start_station_name 94802 non-null object 5 start_station_latitude 94802 non-null float64 6 start_station_longitude 94802 non-null float64 7 end_station_id 94802 non-null int64 8 end_station_name 94802 non-null object 9 end_station_latitude 94802 non-null float64 10 end_station_longitude 94802 non-null float64 11 bike_id 94802 non-null int64 12 user_type 94802 non-null object 13 member_birth_year 86963 non-null float64 14 member_gender 87001 non-null object 15 bike_share_for_all_trip 94802 non-null object dtypes: float64(5), int64(4), object(7) memory usage: 11.6+ MB
Duplicate Values¶
# Dataset Duplicate Value Count
print('Number of duplicates in dataset =',fordbike_df.duplicated().sum())
Number of duplicates in dataset = 0
Missing Values/Null Values¶
# Missing Values/Null Values Count
print('Number of missing values in dataset =',fordbike_df.isnull().sum().sum())
Number of missing values in dataset = 15640
# Visualizing the missing values
# Calculating the number of missing values per column
missing_values = fordbike_df.isnull().sum()
# Plotting the missing values
fig, ax = plt.subplots(figsize=(16, 6))
bars = ax.bar(missing_values.index, missing_values.values, color='teal')
# Adding data labels on top of the bars
for bar in bars:
yval = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2, yval, f'{yval}',
ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
# Adding labels and title
plt.title('Missing Values per Column', fontsize=16, fontweight='bold')
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values', fontsize=12)
# Rotating x-axis labels for better readability
plt.xticks(rotation=45)
# Display the plot
plt.show()
What did you know about your dataset?¶
This dataset contains information on individual bike-sharing trips made in the greater San Francisco Bay Area. The dataset includes 94,802 entries, and it is used to analyze ride patterns, trip durations, and the influence of factors like user type, seasonality, and demographics.
2. Understanding Your Variables¶
# Dataset Columns
fordbike_df.columns
Index(['duration_sec', 'start_time', 'end_time', 'start_station_id',
'start_station_name', 'start_station_latitude',
'start_station_longitude', 'end_station_id', 'end_station_name',
'end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type',
'member_birth_year', 'member_gender', 'bike_share_for_all_trip'],
dtype='object')
# Dataset Describe
fordbike_df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | |
|---|---|---|---|---|---|---|---|---|---|
| count | 94802.000000 | 94802.000000 | 94802.000000 | 94802.000000 | 94802.00000 | 94802.000000 | 94802.000000 | 94802.000000 | 86963.000000 |
| mean | 870.935930 | 103.766302 | 37.773321 | -122.361677 | 101.00982 | 37.773536 | -122.360776 | 2048.751609 | 1980.932420 |
| std | 2550.596891 | 87.730464 | 0.085744 | 0.105253 | 86.77949 | 0.085552 | 0.104580 | 1091.507513 | 10.803017 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.444293 | 3.00000 | 37.317298 | -122.444293 | 11.000000 | 1900.000000 |
| 25% | 359.000000 | 30.000000 | 37.771662 | -122.412408 | 27.00000 | 37.773063 | -122.411306 | 1133.000000 | 1975.000000 |
| 50% | 555.000000 | 79.000000 | 37.781270 | -122.398773 | 76.00000 | 37.781752 | -122.398436 | 2151.500000 | 1983.000000 |
| 75% | 854.000000 | 160.000000 | 37.795392 | -122.390428 | 157.00000 | 37.795392 | -122.390428 | 3015.000000 | 1989.000000 |
| max | 85546.000000 | 342.000000 | 37.880222 | -121.874119 | 342.00000 | 37.880222 | -121.874119 | 3744.000000 | 2000.000000 |
Variables Description¶
duration_sec - Duration of the trip in seconds
start_time - Date and time when the trip started
end_time - Date and time when the trip ended
start_station_id - Unique ID of the starting station
start_station_name - Name of the starting station
start_station_latitude - Latitude of the starting station
start_station_longitude - Longitude of the starting station
end_station_id - Unique ID of the ending station
end_station_name - Name of the ending station
end_station_latitude - Latitude of the ending station
end_station_longitude - Longitude of the ending station
bike_id - Unique identifier for the bike used
user_type - Type of user Subscriber (member) or Customer (casual)
member_birth_year - Birth year of the user
member_gender - Gender of the user (Male, Female, or Other)
bike_share_for_all_trip - Indicates whether the trip was part of the Bike Share for All program (Yes/No)
Check Unique Values for each variable.¶
# Check Unique Values for each variable.
print('Unique Values in dataset:\n')
print(fordbike_df.nunique())
Unique Values in dataset: duration_sec 4512 start_time 94801 end_time 94797 start_station_id 273 start_station_name 273 start_station_latitude 273 start_station_longitude 273 end_station_id 272 end_station_name 272 end_station_latitude 272 end_station_longitude 272 bike_id 3065 user_type 2 member_birth_year 72 member_gender 3 bike_share_for_all_trip 2 dtype: int64
3. Data Wrangling¶
Data Wrangling Code¶
# Write your code to make your dataset analysis ready.
# Create copy of the dataset
df = fordbike_df.copy()
# 1. Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# 2. Create new columns for date, time, and duration in minutes
df['start_date'] = df['start_time'].dt.date
df['start_hour'] = df['start_time'].dt.hour
df['start_day'] = df['start_time'].dt.day_name()
df['start_month'] = df['start_time'].dt.month_name()
df['duration_min'] = df['duration_sec'] / 60 # Convert seconds to minutes
# 3. Handle missing values
# Dropping rows with missing member_birth_year and member_gender
df = df.dropna(subset=['member_birth_year', 'member_gender'])
# Checking missing values after handling
print("\nMissing values after handling:")
print(df.isnull().sum())
Missing values after handling: duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 0 member_gender 0 bike_share_for_all_trip 0 start_date 0 start_hour 0 start_day 0 start_month 0 duration_min 0 dtype: int64
# 4. Create age column from birth year
df['age'] = 2025 - df['member_birth_year']
# 5. Remove outliers in trip duration
# This helps to focus analysis on realistic trip lengths
duration_cap = df['duration_min'].quantile(0.99)
df = df[df['duration_min'] <= duration_cap]
# 6. Converting birth year and age to int
df['member_birth_year'] = df['member_birth_year'].astype(int)
df['age'] = df['age'].astype(int)
# 7. Drop irrelevant columns
df.drop(['bike_id', 'start_station_id', 'end_station_id'], axis=1, inplace=True)
# Dataset shape after cleaning
print('Number of Rows =',df.shape[0])
print('Number of Columns =',df.shape[1])
Number of Rows = 86093 Number of Columns = 19
# View final dataset
df.head()
| duration_sec | start_time | end_time | start_station_name | start_station_latitude | start_station_longitude | end_station_name | end_station_latitude | end_station_longitude | user_type | member_birth_year | member_gender | bike_share_for_all_trip | start_date | start_hour | start_day | start_month | duration_min | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 453 | 2018-01-31 23:53:53.632 | 2018-02-01 00:01:26.805 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | Valencia St at 24th St | 37.752428 | -122.420628 | Subscriber | 1988 | Male | No | 2018-01-31 | 23 | Wednesday | January | 7.55 | 37 |
| 7 | 180 | 2018-01-31 23:52:09.903 | 2018-01-31 23:55:10.807 | Berry St at 4th St | 37.775880 | -122.393170 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | Subscriber | 1980 | Male | No | 2018-01-31 | 23 | Wednesday | January | 3.00 | 45 |
| 8 | 996 | 2018-01-31 23:34:56.004 | 2018-01-31 23:51:32.674 | Valencia St at 24th St | 37.752428 | -122.420628 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | Subscriber | 1987 | Male | Yes | 2018-01-31 | 23 | Wednesday | January | 16.60 | 38 |
| 9 | 825 | 2018-01-31 23:34:14.027 | 2018-01-31 23:47:59.809 | Ryland Park | 37.342725 | -121.895617 | San Salvador St at 9th St | 37.333955 | -121.877349 | Subscriber | 1994 | Female | Yes | 2018-01-31 | 23 | Wednesday | January | 13.75 | 31 |
| 11 | 432 | 2018-01-31 23:34:26.484 | 2018-01-31 23:41:39.297 | Division St at Potrero Ave | 37.769218 | -122.407646 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | Subscriber | 1993 | Male | No | 2018-01-31 | 23 | Wednesday | January | 7.20 | 32 |
What all manipulations have you done and insights you found?¶
Manipulations-¶
- Datetime Conversions:
- start_time and end_time columns were converted to proper datetime formats.
- This allowed easy extraction of time-based features.
- Feature Engineering:
Trip Duration (duration_min) was created by subtracting start_time from end_time and converting to minutes.
Extracted new columns from start_time:
start_hour (hour of the day),
start_day (day of week),
start_month (month name),
start_year (year).
Calculated rider age from birth year.
- Data Type Conversion:
- Numerical fields like member_birth_year and age column were converted into integer datatype.
- Filtering Unusual Records:
- Trips with negative durations or extremely high durations (outliers like >24 hours) were filtered out to ensure accuracy.
- Handling Missing Data:
- Checked null values across all columns.
- Rows with nulls in key columns like member_gender or member_birth_year were removed to maintain data quality.
Insights-¶
- Presence of Outliers:
- Some records showed unrealistically high or negative trip durations, indicating the need for cleaning.
- Missing Demographic Data:
- Several records had missing gender or birth year, which could affect gender-based or age-based insights.
- User and Time Patterns Possible:
- By extracting start_hour, start_day, and start_month, patterns in riding behavior over different times and seasons became analyzable.
- Age Distribution Identified:
- Calculating age from birth year allowed for deeper analysis into age groups of riders and their preferences.
- Trip Duration is a Critical Metric:
- Creating duration_min provided the key measure to compare how trip length varies by age, gender, time of day, and user type.
4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables¶
# Setting visualization style
plt.style.use('fivethirtyeight')
Univariate Analysis¶
Chart - 1 - Distribution of Trip Duration (in minutes)¶
# Chart - 1 visualization code
# Histogram for trip duration
# Set figure size
plt.figure(figsize=(12, 6))
# Create histogram
sns.histplot(df['duration_min'], bins=50, color='teal', edgecolor='black', kde = True)
# Add labels and title
plt.title('Distribution of Trip Duration (in minutes)',fontsize = 16, fontweight='bold')
plt.xlabel('Duration (minutes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Set font size for x & y axis value
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10)
# Show Plot
plt.show()
1. Why did you pick the specific chart?¶
A histogram clearly shows how trip durations are distributed, helping identify common and extreme values.
2. What is/are the insight(s) found from the chart?¶
Most trips are short (under 20 minutes), indicating the system is used for quick commutes.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Poistive Impact - Yes, this insight supports optimizing bike availability for short-distance riders and efficient fleet rotation.
Negative Insight - No, it shows healthy usage patterns. Longer trips may need review, but not necessarily negative.
Chart - 2 - User Types Distribution¶
# Chart - 2 visualization code
# User Type Distribution
# Set figure size
plt.figure(figsize=(12, 6))
# Counting user types
user_type_counts = df['user_type'].value_counts()
# Create bar plot
sns.barplot(x=user_type_counts.index, y=user_type_counts.values, palette='viridis')
# Add data labels
for i, v in enumerate(user_type_counts.values):
plt.text(i, v, str(v), ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
# Add labels and title
plt.title('Bar Chart of User Types', fontsize=16, fontweight='bold')
# Set font size for x & y axis value
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Number of Riders', fontsize=12)
# Set font size for x & y axis value
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show plot
plt.show()
1. Why did you pick the specific chart?¶
A bar chart is ideal to compare categorical counts such as Subscriber vs Customer.
2. What is/are the insight(s) found from the chart?¶
Majority are Subscribers, indicating high user retention and loyalty.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, strong subscriber base helps in revenue predictability and customer lifetime value.
Negative Insight - No, fewer customers indicate potential for growth through short-term promotions.
Chart - 3 - Member Gender Distribution¶
# Chart - 3 visualization code
# Member Gender Distribution
plt.figure(figsize=(12, 6))
sns.barplot(x=df['member_gender'].value_counts().index, y=df['member_gender'].value_counts().values, palette='viridis')
for i, v in enumerate(df['member_gender'].value_counts().values):
plt.text(i, v, str(v), ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
plt.title('Bar Chart of Member Gender Distribution', fontsize=16, fontweight='bold')
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Number of Riders', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
1. Why did you pick the specific chart?¶
A bar plot is ideal as it helps to identify gender participation using a simple and clear comparison.
2. What is/are the insight(s) found from the chart?¶
Males dominate ridership, with females underrepresented.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, it highlights opportunity to target female riders through safety or awareness campaigns.
Negative Insight - Gender imbalance could be a sign of safety or comfort issues for female riders.
Chart - 4 - Member Age Distribution¶
# Chart - 4 visualization code
# Member Age Distribution
# Create histogram
fig = px.histogram(df, x='age', nbins=40,
title='Age Distribution of Members',
labels={'age': 'Age'})
fig.update_traces(marker_color='slateblue')
fig.update_layout(bargap=0.05)
# Add Title
fig.update_layout(title_text='Age Distribution of Members', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
Histograms show spread and concentration of rider age.
2. What is/are the insight(s) found from the chart?¶
Most riders are between 25-40 years old, suggesting a young working demographic.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, helps tailor services (e.g., mobile apps, offers) to a tech-savvy, commuter audience.
Negative Insight - May indicate under-engagement of seniors, not critical but room for inclusion.
Chart - 5 - Start Hour Distribution¶
# Chart - 5 visualization code
# Start Hour Distribution
hourly = df['start_hour'].value_counts().sort_index()
# Create bar chart
fig = px.bar(x=hourly.index, y=hourly.values,
labels={'x': 'Hour of Day', 'y': 'Number of Trips'},
title='Number of Trips by Hour of Day')
fig.update_traces(marker_color='teal', text=hourly.values, textposition='outside')
fig.update_layout(xaxis=dict(dtick=1), uniformtext_minsize=8)
# Add Title
fig.update_layout(title_text='Number of Trips by Hour of Day', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
Shows peak demand hours and usage patterns.
2. What is/are the insight(s) found from the chart?¶
Peak usage occurs during 8 AM and 5-6 PM, confirming commute-driven behavior.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, helps optimize fleet positioning and maintenance outside peak times.
Negative Insight - No, but over-reliance on rush hours can risk under-utilization during other hours.
Chart - 6 - Start Day of Week Distribution¶
# Chart - 6 visualization code
# Start Day of Week Distribution
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Create bar plot
fig = px.bar(df['start_day'].value_counts().reindex(day_order).reset_index(),
x='start_day', y='count',
labels={'start_day': 'Day of Week', 'count': 'Number of Trips'},
title='Number of Trips by Day of Week')
fig.update_traces(marker_color='orange', textposition='outside', text=df['start_day'].value_counts().reindex(day_order).values)
fig.update_layout(uniformtext_minsize=8)
# Add Title
fig.update_layout(title_text='Number of Trips by Day of Week', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A bar chart effectively shows the distribution of rides across each day, making it easy to identify peak days and behavioral patterns.
2. What is/are the insight(s) found from the chart?¶
Most rides happen on weekdays, especially Tuesday to Thursday, indicating heavy use for weekday commuting.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, it informs operational planning—more bikes and support staff can be allocated during peak weekday hours.
Negative Insight - Lower usage on weekends may indicate missed leisure ride opportunities. It’s not a negative, but a business growth area.
Chart - 7 - Top 10 Start Stations¶
# Chart - 7 visualization code
# Top 10 Start Stations
top_starts = df['start_station_name'].value_counts().nlargest(10)
# Create bar plot
fig = px.bar(top_starts, x=top_starts.values, y=top_starts.index,
labels={'x': 'Number of Trips', 'y': 'Start Station'},
title='Top 10 Start Stations')
fig.update_traces(marker_color='seagreen', text=top_starts.values, textposition='outside')
fig.update_layout(uniformtext_minsize=8, xaxis_tickangle=-45)
# Add Title
fig.update_layout(title_text='Top 10 Start Stations', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A vertical bar chart displays the top-performing stations clearly, even with long names.
2. What is/are the insight(s) found from the chart?¶
Certain stations dominate as popular starting points—often located in commercial or transit-dense areas.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, these stations can be prioritized for bike availability, rebalancing, and marketing.
Negative Insight - Over-reliance on a few stations might strain infrastructure; expansion of underutilized stations could balance usage.
Chart - 8 - Top 10 End Stations¶
# Chart - 8 visualization code
# Top 10 End Stations
top_ends = df['end_station_name'].value_counts().nlargest(10)
# Create bar plot
fig = px.bar(top_ends, x=top_ends.values, y=top_ends.index,
labels={'x': 'Number of Trips', 'y': 'End Station'},
title='Top 10 End Stations')
fig.update_traces(marker_color='crimson', text=top_ends.values, textposition='outside')
fig.update_layout(uniformtext_minsize=8, xaxis_tickangle=-45)
# Add Title
fig.update_layout(title_text='Top 10 End Stations', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
Like the previous chart, it highlights most common destinations using a clean, comparative format.
2. What is/are the insight(s) found from the chart?¶
Popular drop-off points align closely with start stations, suggesting frequent round-trips or commuter corridors.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes, end-station data helps optimize bike docking, availability, and customer satisfaction.
Negative Insight - If end stations are overloaded compared to start stations, it could cause operational imbalance (bike shortages).
Bivariate Analysis¶
Chart - 9 - Trips by Day of Week (Split by User Type)¶
# Chart - 9 visualization code
# Trips by Day of Week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_user = df.groupby(['start_day', 'user_type']).size().unstack().reindex(day_order)
# Create line plot
fig = go.Figure()
for user in day_user.columns:
fig.add_trace(go.Scatter(x=day_order, y=day_user[user],
mode='lines+markers',
name=user))
# Add Title
fig.update_layout(title='Trips by Day of Week by User Type',
xaxis_title='Day of Week',
yaxis_title='Number of Trips',
legend_title='User Type')
# Set title size and weight
fig.update_layout(title_text='Trips by Day of Week by User Type', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Set border to legend
fig.update_layout(legend=dict(bordercolor='black', borderwidth=1))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A line chart is ideal for visualizing trends over a categorical sequence like days of the week. Splitting by user type (e.g., Subscriber vs Customer) makes it easy to compare usage patterns across user segments.
2. What is/are the insight(s) found from the chart?¶
Subscribers tend to use bikes heavily on weekdays (especially Tuesday to Thursday), suggesting regular commuting behavior.
Customers show more activity on weekends, indicating casual or recreational use.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes, it allows Ford GoBike to tailor their services:
- Prioritize bike availability and support staff during weekdays for subscribers.
- Plan promotions, events, or tourist-targeted offers for weekends to attract customers.
Negative Insight -
The relatively low weekday engagement by customers could indicate a missed opportunity. By understanding this gap, the business can target strategies to encourage casual weekday riders (e.g., discounts or partnerships with local businesses).
Chart - 10 - Trip Duration by Gender (Box Plot)¶
# Chart - 10 visualization code
# Trip Duration by Gender
# Create box plot
fig = px.box(df, x='user_type', y='duration_min',
title='Trip Duration by User Type',
labels={'duration_min': 'Trip Duration (minutes)', 'user_type': 'User Type'})
# Add Title
fig.update_layout(title_text='Trip Duration by User Type', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A box plot is ideal for visualizing the distribution, spread, and outliers in trip durations across user types. It highlights medians, quartiles, and variability at a glance.
2. What is/are the insight(s) found from the chart?¶
Customers tend to have longer and more variable trip durations compared to Subscribers, whose trips are shorter and more consistent.
Customers show more extreme outliers, possibly indicating exploratory or recreational rides.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact - Yes. It gives clear signals about user behavior:
- Subscribers likely use the service for short, regular commutes.
- Customers may benefit from targeted plans that accommodate longer, flexible trips—e.g., daily passes or weekend bundles.
Negative Insight - Not necessarily negative, but the high variability and outliers in customer trips may imply irregular usage patterns. If not planned for (e.g., ensuring bike availability for extended durations), it could strain inventory and impact customer experience.
Chart - 11 - Trip Count by Hour (Split by Gender)¶
# Chart - 11 visualization code
# Trip Count by Hour (Split by Gender)
hour_gender = df.groupby(['start_hour', 'member_gender']).size().unstack()
# Create bar plot
fig = px.bar(hour_gender,
x=hour_gender.index,
y=hour_gender.columns,
title='Trip Count by Hour of Day and Gender',
labels={'value': 'Number of Trips', 'start_hour': 'Hour of Day'})
fig.update_layout(barmode='group', xaxis=dict(dtick=1), legend_title='Gender')
# Add Title
fig.update_layout(title_text='Trip Count by Hour of Day and Gender', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A bar chart provides a clear, comparative view of ride frequency across each hour of the day. Splitting it by gender adds a valuable demographic dimension to understand who rides when.
2. What is/are the insight(s) found from the chart?¶
Peak trip counts are observed during morning (7–9 AM) and evening (4–6 PM), indicating commute patterns.
Male users dominate the rides across most hours, especially during commuting times.
Female and other gender categories show a smaller, more evenly spread riding pattern.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes, these patterns can inform:
- Operational decisions like when to rebalance bikes.
- Gender-targeted marketing (e.g., safety awareness or incentives for female riders during off-peak hours).
- Infrastructure planning to support commuter-heavy hours.
Negative Impact -
The gender imbalance significantly fewer female riders may point to safety concerns or lack of targeted outreach. Addressing this could expand user diversity and grow ridership.
Chart - 12 - Age by User Type¶
# Chart - 12 visualization code
# Age by User Type
# Create box plot
fig = px.box(df, x='user_type', y='age',
title='Age Distribution by User Type',
labels={'age': 'Age', 'user_type': 'User Type'})
# Add Title
fig.update_layout(title_text='Age Distribution by User Type', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A box plot is effective for comparing age distributions across user types. It reveals medians, ranges, and outliers, making it easy to spot demographic trends and differences.
2. What is/are the insight(s) found from the chart?¶
Subscribers generally fall into a narrower and younger age range (mostly mid-20s to mid-40s), indicating a strong working professional base.
Customers show a wider age range, including older users, suggesting more occasional or leisure-based use.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes. This insight enables:
- Tailored marketing—focus professional commuting plans toward the dominant subscriber age group.
- Diversification strategies—design campaigns or services attractive to older or casual riders (e.g., guided tours, senior discounts).
Negative Insight -
The lack of younger or older subscribers might hint at unmet needs or barriers (e.g., pricing or accessibility). Addressing these could unlock new user segments and expand the service's reach.
Chart - 13 - Trip Duration by Gender¶
# Chart - 13 visualization code
# Trip Duration by Gender
# Create Violin Plot
fig = px.violin(df, x='member_gender', y='duration_min',
title='Trip Duration by Gender',
labels={'duration_min': 'Trip Duration (minutes)', 'member_gender': 'Gender'})
# Add Title
fig.update_layout(title_text='Trip Duration by Gender', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A violin plot combines the benefits of a box plot and a kernel density plot. It not only shows the spread and median of trip durations but also gives a visual sense of the distribution shape for each gender category. This helps in spotting asymmetry or concentration of trip durations.
2. What is/are the insight(s) found from the chart?¶
Male and female riders have fairly similar median trip durations, but female riders show slightly more variation.
The "Other"/Unspecified gender group shows more dispersed durations and noticeable outliers, suggesting less consistent trip behavior or possibly a smaller sample size.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes. Understanding trip behavior by gender helps the company:
- Refine service offerings or communication strategies to better align with how different genders use the service.
- Recognize patterns that might indicate barriers or unique usage needs, which could be addressed to improve user experience.
Negative Insight -
The wider variability in trip durations for the "Other" category could indicate less predictability in how this group uses the system. This might point to a lack of tailored service offerings or accessibility gaps that could be limiting consistent usage.
Chart - 14 - Average Trip Duration by Start Day of Week¶
# Chart - 14 visualization code
# Average Trip Duration by Start Day of Week
avg_duration = df.groupby('start_day')['duration_min'].mean().reindex(day_order)
# Create bar plot
fig = px.bar(x=avg_duration.index, y=avg_duration.values,
labels={'x': 'Day of Week', 'y': 'Average Duration (minutes)'},
title='Average Trip Duration by Day of Week')
fig.update_traces(marker_color='darkorange', text=avg_duration.values.round(1), textposition='outside')
fig.update_layout(uniformtext_minsize=8)
# Add title
fig.update_layout(title_text='Average Trip Duration by Day of Week', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A bar chart is ideal for comparing average values across categorical variables. In this case, it clearly shows how trip duration varies from Monday to Sunday, helping uncover behavioral trends tied to the day of the week.
2. What is/are the insight(s) found from the chart?¶
Weekends (Saturday and Sunday) show higher average trip durations compared to weekdays, indicating more leisure or exploratory rides.
Weekdays, especially Tuesday to Thursday, have shorter average durations, likely reflecting structured commute patterns.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes. These insights support operational and marketing decisions:
- Weekday focus on efficiency and quick bike turnover for commuters.
- Weekend offerings like all-day passes or event partnerships targeting tourists and casual riders.
Negative Insight -
Longer average durations on weekends could lead to bike unavailability or shortages if inventory isn’t adjusted. This insight helps preemptively address such issues by improving redistribution strategies and resource planning.
Chart - 15 - Average Trip Duration by Start Hour¶
# Chart - 15 visualization code
# Average Trip Duration by Start Hour
# Calculate average trip duration by hour
avg_duration_by_hour = df.groupby('start_hour')['duration_sec'].mean().reset_index()
# Convert duration from seconds to minutes for easier understanding
avg_duration_by_hour['duration_min'] = avg_duration_by_hour['duration_sec'] / 60
# Plot using Plotly
fig = px.line(avg_duration_by_hour,
x='start_hour',
y='duration_min',
title='Average Trip Duration by Start Hour',
labels={'start_hour': 'Start Hour of Day', 'duration_min': 'Average Duration (minutes)'},
markers=True)
# Customize layout
fig.update_traces(mode="lines+markers", line=dict(color='royalblue', width=3))
fig.update_layout(title_x=0.5,
xaxis=dict(dtick=1),
yaxis=dict(tickformat=".1f"),
title_font=dict(size=18, family='Arial', color='black', weight='bold'),
hovermode='x unified')
# Show chart
fig.show()
1. Why did you pick the specific chart?¶
A line chart is great for visualizing continuous trends—here, it helps observe how average trip duration changes across 24 hours. Plotting it over time (hourly) makes it easy to spot peaks and drops throughout the day.
2. What is/are the insight(s) found from the chart?¶
Average trip duration tends to peak early morning (around 5-7 AM) and late at night (after 9 PM), likely due to leisure or non-commuting trips.
Midday to early evening (10 AM - 7 PM) shows more consistent, slightly shorter trip durations, likely influenced by work or errand-based usage.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes.
- The company can plan dynamic pricing or promotions during low-usage hours.
- Support fleet rebalancing efforts by knowing when longer trips are likely.
- Design personalized offers based on user trip patterns (e.g., incentives for off-peak travel).
Negative Insight -
Possibly. If long trips during odd hours go unaddressed (e.g., no rebalancing or bike availability), it may lead to stockouts or delays, harming user satisfaction. These need to be monitored and managed proactively.
Multivariate Analysis¶
Chart - 16 - User Type by Gender and Day of Week¶
# Chart - 16 visualization code
multi_df = df.groupby(['start_day', 'member_gender', 'user_type']).size().reset_index(name='count')
multi_df = multi_df[multi_df['member_gender'].notna()]
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
multi_df['start_day'] = pd.Categorical(multi_df['start_day'], categories=day_order, ordered=True)
# Create grouped bar plot
fig = px.bar(multi_df, x='start_day', y='count', color='user_type', barmode='group',
facet_col='member_gender', category_orders={'start_day': day_order},
title='User Type by Gender and Day of Week',
labels={'count': 'Number of Trips', 'start_day': 'Day of Week'})
fig.update_layout(legend_title='User Type')
# Add title
fig.update_layout(title_text='User Type by Gender and Day of Week', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Add legend border and title
fig.update_layout(legend=dict(bordercolor='black', borderwidth=1))
fig.update_layout(legend_title_text='User Type')
# Show plot
fig.show()
1. Why did you pick the specific chart?¶
A grouped bar plot is ideal for comparing multiple categorical variables. In this case, it visually contrasts user types (Customer vs. Subscriber) across gender and day of the week. This format makes it easier to identify usage trends and disparities between groups over time.
2. What is/are the insight(s) found from the chart?¶
Subscribers dominate across all genders, especially on weekdays, confirming that subscribers are more likely commuters.
Customers show increased activity on weekends, particularly among female riders, indicating more recreational or flexible use.
Male riders are the most active across both user types, but the gender gap slightly narrows during the weekend.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes. This visualization can:
Guide targeted marketing strategies—e.g., promoting weekend passes or tourist packages to customer-type users, especially females.
Inform operational planning—adjust bike availability and maintenance cycles based on expected user load by type and gender.
Help build personalized communication—like offering subscription incentives to frequent female weekend riders.
Negative Insight -
Yes, potentially:
The gender imbalance (lower female participation during weekdays) may indicate barriers like safety, route awareness, or convenience.
Overdependence on subscribers for weekday usage may lead to vulnerability if subscriber numbers drop due to alternative commuting options.
Chart - 17 - Age vs. Trip Duration by Gender & User Type¶
# Chart - 17 visualization code
# Age vs. Trip Duration by Gender & User Type
filtered_df = df[(df['age'] > 0) & (df['duration_min'] < 120)] # Filter out outliers for clarity
# Create scatter plot
fig = px.scatter(filtered_df, x='age', y='duration_min',
color='member_gender', facet_col='user_type',
title='Trip Duration vs Age by User Type and Gender',
labels={'duration_min': 'Trip Duration (minutes)', 'age': 'Age'})
fig.update_traces(marker=dict(size=5, opacity=0.6))
# Add title
fig.update_layout(title_text='Trip Duration vs. Age by User Type and Gender', title_x=0.5)
fig.update_layout(title_font=dict(size=18, family='Arial', color='black', weight='bold'))
# Add legend border and title
fig.update_layout(legend=dict(bordercolor='black', borderwidth=1))
fig.update_layout(legend_title_text='Gender')
# Show Plot
fig.show()
1. Why did you pick the specific chart?¶
A scatter plot is best for visualizing relationships between two continuous variables—here, age and trip duration. Coloring by gender and splitting by user type adds a powerful dimension for comparative analysis. This helps reveal patterns, clusters, or outliers across demographics and usage types.
2. What is/are the insight(s) found from the chart?¶
Trip duration generally decreases with increasing age, especially for subscribers, suggesting younger users take longer, possibly more exploratory trips.
Customers show more variation in duration across all ages, indicating less structured ride behavior.
Female riders tend to cluster in the mid-age range with moderate durations, while male riders dominate across broader age and duration ranges.
Outliers are present, especially among customers—some users (regardless of gender) have very long trips, possibly tourists or infrequent riders.
3. Will the gained insights help creating a positive business impact?¶
Are there any insights that lead to negative growth? Justify with specific reason.
Positive Impact -
Yes.
Helps the company tailor marketing and promotions: e.g., adventure packages for younger riders or short-trip commuter deals for older subscribers.
Enables user experience improvements by designing better ride plans or app features suited to common age-duration patterns.
Reveals potential new user segments like young female customers who may be underrepresented and can be engaged more actively.
Negative Insight -
Yes, potentially:
The lack of older customers with longer trips may signal that current offerings don't appeal to or accommodate them. This could be a missed revenue opportunity.
If outliers with very long durations represent lost or misused bikes, it could lead to asset misuse or availability issues, affecting overall service efficiency.
Chart - 18 - Correlation Heatmap¶
# Correlation Heatmap visualization code
# Selecting only numeric columns for correlation
numeric_cols = df[['duration_min', 'age', 'start_hour']]
# Correlation matrix
corr_matrix = numeric_cols.corr()
# Plot using seaborn
plt.figure(figsize=(14, 5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f') # Create heatmap
# Add title
plt.title('Correlation Heatmap', fontweight = 'bold', fontsize = 16)
# Show plot
plt.show()
1. Why did you pick the specific chart?¶
A correlation heatmap is chosen to visualize the strength and direction of linear relationships between multiple numerical variables. It gives an immediate overview of which features are potentially influencing each other, and how strong that influence is.
2. What is/are the insight(s) found from the chart?¶
duration_min vs age: Weak positive correlation — older users don’t take significantly longer or shorter trips.
duration_min vs start_hour: Near-zero correlation — trip duration doesn’t vary meaningfully with the hour of the day.
age vs start_hour: Slight negative correlation — older users slightly tend to avoid peak trip hours, but it's very minimal.
Overall, none of the numerical variables show any strong linear relationship with each other.
Chart - 19 - Pair Plot¶
# Pair Plot visualization code
# Sample the dataset to keep the pairplot fast and readable
sample_df = df[['duration_min', 'age', 'start_hour', 'member_gender']].dropna().sample(1000, random_state=42)
# Set figure size
plt.figure(figsize=(12, 5))
# Convert gender to categorical for coloring
sns.pairplot(sample_df, hue='member_gender', diag_kind='kde', palette='husl')
# Add title
plt.suptitle('Pairplot of Duration, Age, and Start Hour by Gender', y=1.02, fontweight = 'bold', fontsize = 16)
# Show plot
plt.show()
<Figure size 1200x500 with 0 Axes>
1. Why did you pick the specific chart?¶
A pairplot is ideal when exploring pairwise relationships between multiple numerical features, especially when segmented by a categorical feature like gender. It allows us to visually assess distributions and potential correlations in a compact matrix form, while also comparing patterns across genders.
2. What is/are the insight(s) found from the chart?¶
Trip Duration vs Age: Both males and females show a wide spread across age groups for trip durations, but no visible strong trend — suggesting duration is not strongly age-dependent.
Start Hour vs Age: No clear pattern, indicating trip start times are spread fairly evenly across age groups.
Duration vs Start Hour: There’s a slight concentration of shorter trips during peak hours for both genders, but not significantly different between them.
Distribution Differences: The distribution of age and start hour is fairly consistent across genders, but trip duration appears slightly more varied for males.
5. Solution to Business Objective¶
What do you suggest the client to achieve Business Objective ?¶
Explain Briefly.
- Focus on Subscriber Retention & Growth :
Since Subscribers form the majority of the user base, and they consistently show higher ride frequency and shorter trip durations (indicating utility usage), prioritize:
- Retention strategies: Loyalty programs, renewal discounts, referral incentives.
- Conversion of casual customers to subscribers through trial passes and promotions.
- Optimize Bike Availability by Time and Location :
The analysis showed peak usage during morning and evening commute hours and specific high-traffic stations. Use this to:
- Strategically rebalance bikes across top start/end stations.
- Implement dynamic reallocation schedules aligned with hourly usage trends.
- Enhance Targeting Based on Gender & Age Segments :
Though trip patterns across gender and age didn't show strong variation, segment-level marketing can still be valuable:
- Offer youth-focused campaigns and student discounts (as younger users have relatively higher usage).
- Develop safety or comfort features that appeal to female riders, encouraging more balanced usage.
- Improve Trip Experience Through App Features :
Since trip durations are generally short and predictable:
- Use in-app tools to recommend start/end stations based on user location and availability.
- Provide real-time updates on station capacity to avoid frustration and missed trips.
- Seasonal Promotions and Off-Peak Incentives :
Usage is lower in certain months or non-peak hours:
- Introduce seasonal offers or “Happy Hour” discounts to encourage off-peak usage.
- Partner with local events or businesses to bundle ride offers during less busy times.
- Monitor Underperforming Stations :
Identify stations with consistently low usage and:
- Investigate accessibility issues.
- Consider relocating or marketing those locations better.
Conclusion¶
The Ford GoBike analysis uncovered clear patterns in user behavior and trip usage. Subscribers are the primary users, mostly riding during weekday peak hours for commuting, while customers prefer weekends and longer rides, indicating recreational use.
Peak usage occurs during rush hours, and certain stations consistently experience high demand, guiding better bike distribution strategies. Young male adults (25-35) dominate ridership, suggesting a target demographic for marketing efforts.
While correlations between numeric variables were minimal, categorical insights such as user type, day of week, and gender provided stronger patterns for business action.
These insights can help Ford GoBike enhance operations, increase customer satisfaction, and grow ridership by optimizing availability, targeting promotions, and improving service in high-demand areas.